Name - Surname = Ozan Can Demir
Student Number = 402533
Department = Data Science and Business Analytics
Project = Personal Loan Analysis
The purpose of this project to perform data analysis to fetch some meaningful insights and valuable information from the the consumer dataset. I will be using the dataset Personal Loan Analysis in order to analyze variables such as income, education, family and experience and their correlation between each other. And we will be also showing how they effect the consumer loan acceptance process and figure the customer profile out who will most likely to accept the offer for personal loan, based on the specific relationship with the bank across various features given in the dataset.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings(action="ignore")
plt.style.use('ggplot')
%matplotlib inline
The Data which is used in my project was downloaded from the source: Data
The Data contains information on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan).
df = pd.read_csv('C:\\Users\\ozanc\\Desktop\\Data_Analysis_Project\\Bank_Personal_Loan_Modelling.csv')
df.columns
Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
'Education', 'Mortgage', 'Personal Loan', 'Securities Account',
'CD Account', 'Online', 'CreditCard'],
dtype='object')
Let's remove ID and Zipcode, since we will not be using them in our analysis:
df.rename(columns={
"CD Account":"CD_Account",
"CreditCard":"Credit_Card"},
inplace=True)
Rename columns "CD Account" and "CreditCard":
df.drop(['ID','ZIP Code'],axis=1,inplace=True)
df.head()
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal Loan | Securities Account | CD_Account | Online | Credit_Card | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8 | 45 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
df.head()
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal Loan | Securities Account | CD_Account | Online | Credit_Card | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8 | 45 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
df.tail()
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal Loan | Securities Account | CD_Account | Online | Credit_Card | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 29 | 3 | 40 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 30 | 4 | 15 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 63 | 39 | 24 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 65 | 40 | 49 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 28 | 4 | 83 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
df.sample(5)
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal Loan | Securities Account | CD_Account | Online | Credit_Card | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1925 | 43 | 19 | 81 | 1 | 0.3 | 1 | 218 | 0 | 0 | 0 | 1 | 0 |
| 595 | 42 | 18 | 41 | 1 | 1.8 | 1 | 94 | 0 | 0 | 0 | 1 | 0 |
| 607 | 28 | 3 | 170 | 1 | 0.1 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2303 | 47 | 21 | 89 | 2 | 0.8 | 3 | 0 | 0 | 0 | 0 | 1 | 1 |
| 1318 | 52 | 26 | 178 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 Family 5000 non-null int64 4 CCAvg 5000 non-null float64 5 Education 5000 non-null int64 6 Mortgage 5000 non-null int64 7 Personal Loan 5000 non-null int64 8 Securities Account 5000 non-null int64 9 CD_Account 5000 non-null int64 10 Online 5000 non-null int64 11 Credit_Card 5000 non-null int64 dtypes: float64(1), int64(11) memory usage: 468.9 KB
df.shape
(5000, 12)
Our dataset after changes and updates has 5000 observations and 12 variables.
df.isnull().sum()
Age 0 Experience 0 Income 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal Loan 0 Securities Account 0 CD_Account 0 Online 0 Credit_Card 0 dtype: int64
Let's visualize it:
sns.heatmap(df.isna(), yticklabels=False, cbar=False, cmap='viridis')
<AxesSubplot:>
We have no missing value in our dataset. Now, we are ok to move forward and start our data analysis.
Let's make decriptive visualization on our dataset:
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.0 | 45.0 | 55.0 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.0 | 20.0 | 30.0 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.0 | 64.0 | 98.0 | 224.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.0 | 2.0 | 3.0 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.7 | 1.5 | 2.5 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.0 | 2.0 | 3.0 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.0 | 0.0 | 101.0 | 635.0 |
| Personal Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Securities Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| Credit_Card | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
fig=px.box(df,y=['Age', 'Experience', 'Income', 'Family', 'Education'])
fig.show()
We can see the Min, Max, Mean and Standard Deviation for all key attributes of the dataset. "Income" has too much noise and slightly skewed right, "Age" and "Ezperience" are equally distributed.
Let's check skewness by numbers:
df.skew()
Age -0.029341 Experience -0.026325 Income 0.841339 Family 0.155221 CCAvg 1.598443 Education 0.227093 Mortgage 2.104002 Personal Loan 2.743607 Securities Account 2.588268 CD_Account 3.691714 Online -0.394785 Credit_Card 0.904589 dtype: float64
Now we will visualize the Skewness with graphs as below:
df.hist(figsize=(20,20))
array([[<AxesSubplot:title={'center':'Age'}>,
<AxesSubplot:title={'center':'Experience'}>,
<AxesSubplot:title={'center':'Income'}>],
[<AxesSubplot:title={'center':'Family'}>,
<AxesSubplot:title={'center':'CCAvg'}>,
<AxesSubplot:title={'center':'Education'}>],
[<AxesSubplot:title={'center':'Mortgage'}>,
<AxesSubplot:title={'center':'Personal Loan'}>,
<AxesSubplot:title={'center':'Securities Account'}>],
[<AxesSubplot:title={'center':'CD_Account'}>,
<AxesSubplot:title={'center':'Online'}>,
<AxesSubplot:title={'center':'Credit_Card'}>]], dtype=object)
Now , we will visualize "Experience" by distplot to see the distribution of the experience. Since there cannot be negative experience, if our Experience variable has any negative value, we will need to replace them with mean values of our Experience variable for better analysis purpose.
plt.figure(figsize=(12,6))
sns.distplot(df['Experience'])
<AxesSubplot:xlabel='Experience', ylabel='Density'>
Based on the graph, we can see that we have negative values in our data set under the variable "Experience". Now, we will deal with these negative values.
Let's first get our mean of "Experience":
df['Experience'].mean()
20.1046
We need to store these negative values in another dataframe:
N_values = df[df['Experience']<0]
N_values.head()
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal Loan | Securities Account | CD_Account | Online | Credit_Card | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 89 | 25 | -1 | 113 | 4 | 2.30 | 3 | 0 | 0 | 0 | 0 | 0 | 1 |
| 226 | 24 | -1 | 39 | 2 | 1.70 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 315 | 24 | -2 | 51 | 3 | 0.30 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 451 | 28 | -2 | 48 | 2 | 1.75 | 3 | 89 | 0 | 0 | 0 | 1 | 0 |
| 524 | 24 | -1 | 75 | 4 | 0.20 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
Let's also check how many negative experience values does our data have:
N_values.size
624
We have in total 642 values in our dataset with negative experience values. This is really high number of outliers and can easily effect our analysis. For accuracy reasons, now we need to remove these outliers from our dataset and replace them with mean of the Experience:
new_df=df.copy()
new_df['Experience']=np.where(new_df['Experience']<0,new_df['Experience'].mean(),new_df['Experience'])
Now, we need make sure whether we replaced these negative values with mean or not. To do so:
plt.figure(figsize=(12,6))
sns.distplot(new_df['Experience'])
<AxesSubplot:xlabel='Experience', ylabel='Density'>
new_df[new_df['Experience']<0]
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal Loan | Securities Account | CD_Account | Online | Credit_Card |
|---|
As we can see, there is no negative value in our Experience variable anymore. Now, we are ok to move forward for Corelation Analysis.
In this section, we will visualize each variable except for "Education" in more visualized way that how the distributions are.
new_df.columns
Index(['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education',
'Mortgage', 'Personal Loan', 'Securities Account', 'CD_Account',
'Online', 'Credit_Card'],
dtype='object')
col_names1 = ['Age', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Personal Loan', 'Securities Account', 'CD_Account',
'Online', 'Credit_Card']
# Please scroll down in cell to see all variables
for i in col_names1:
plt.figure(figsize=(12,5))
sns.distplot(df[i])
Now, since the variable "Experience" is important in this data an analysis, we will be checking the association of Experience with other quantitive variables on the below:
experience_association1 = ['Age', 'Income', 'Income', 'CCAvg', 'Mortgage']
experience_association2 = sns.PairGrid(df, y_vars = 'Experience', x_vars = experience_association1)
experience_association2.map(sns.regplot)
<seaborn.axisgrid.PairGrid at 0x2e265a19b20>
When we see the results above, we can claim that "Age" and "Experience" variables are higly associated with each other.
In order to observe the correlations between variables in an advanced way, we will be using heatmap as below:
plt.figure(figsize=(15,6))
sns.heatmap(new_df.corr(),annot=True)
<AxesSubplot:>
We can see that:
As it has been stated above that "Experience" and "Age" variables are highly correlated to each other. If we will make analysis on "Age", then we will need to drop "Experience" variable from our dataset avoid multi-colinearity issue:
new_df=new_df.drop(['Experience'],axis=1)
new_df.head()
| Age | Income | Family | CCAvg | Education | Mortgage | Personal Loan | Securities Account | CD_Account | Online | Credit_Card | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 49 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 45 | 34 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 39 | 11 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 100 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 45 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
As a second check, it's time for control of the statement in which we have to analyze education, the status of the customers that are going to apply for the loan in the bank. What are the unique values in my education column?
new_df['Education'].unique()
array([1, 2, 3], dtype=int64)
So we will assign all the three values to one, two, three as Undergraduate, Graduate and Professional:
def update(x):
if x==1:
return 'Undergraduate'
elif x==2:
return 'Graduate'
else:
return 'Professional'
new_df['Education_Des'] = new_df['Education'].apply(update)
new_df.head()
| Age | Income | Family | CCAvg | Education | Mortgage | Personal Loan | Securities Account | CD_Account | Online | Credit_Card | Education_Des | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 49 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | Undergraduate |
| 1 | 45 | 34 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | Undergraduate |
| 2 | 39 | 11 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Undergraduate |
| 3 | 35 | 100 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Graduate |
| 4 | 35 | 45 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | Graduate |
So now we will to analyze what are the categories of all the persons that are going to apply for the loan purpose in your bank for this purpose? So I have to group on the basis of this education or the score.
EDUs = new_df.groupby('Education_Des')['Age'].count()
EDUs
Education_Des Graduate 1403 Professional 1501 Undergraduate 2096 Name: Age, dtype: int64
fig=px.pie(new_df,values=EDUs, names=EDUs.index,title='Education Distribution')
fig.show()
1- Undergraduates 41.9%
2- Graduates(28.1%)
3- Professional(30%)
In our data, most of people have undergradute level of education.
px.box(new_df,x='Education',y='Income',facet_col='Personal Loan')
Let's also check how income is distributed between education groups:
f,ax = plt.subplots(1,1,figsize=(15,6))
ax = sns.barplot(x = new_df['Education'], y = 'Income',
data=new_df, palette='Spectral')
From the above plot we could say that Income of customers who availed personal loan are alomst same irrescpective of their Education. Now, we will make it more visualized:
plt.figure(figsize=(15,8))
sns.distplot(new_df[new_df['Personal Loan']==0]['Income'],hist=False,label='Income with no personal loan')
sns.distplot(new_df[new_df['Personal Loan']==1]['Income'],hist=False,label='Income with personal loan')
plt.legend()
<matplotlib.legend.Legend at 0x2e265a3bb20>
As a result, customers who have availed personal loan seem to have higher income than those who do not have personal loan. We can also simply think about it that the people who have personal loans are mored tend to have more income than the other people who do not have the personal loans. It is because people with more income are able to make monthly payments. Let's visualize them seperately on the below:
new_df.columns
Index(['Age', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage',
'Personal Loan', 'Securities Account', 'CD_Account', 'Online',
'Credit_Card', 'Education_Des'],
dtype='object')
col_names=['Securities Account','Credit_Card', 'Family', 'CD_Account']
for i in col_names:
plt.figure(figsize=(10,5))
sns.countplot(x=i,hue='Personal Loan',data=new_df)
Security: So what we can get as a result from the above graph that, customers who hold securities account are more likely to take a loan. So most of the customers who do not haev the loan do not have securities account either.
Creadit_Card: For the Customers who have credit card and and also spend the money on a monthly basis higher than others are more likly to take a loan.
Family: We can see that family does not have any real effect in personal loan.
CD_Account: Customers who does not have CD account , does not have loan as well.
As a result, the importance of data analysis cannot be ignored. Companies are paying more and more importance on data analysis in order to optimize their process and maximize their profit. By data analysis, Amazon today can make strategic offers to its customers and Amazon sales has increased sligthly.
Another example is that in 2018, the Houston Rockets, a National Basketball Association, or NBA team, raised their game using Big Data. The Rockets were one of four NBA teams to install a video tracking system which mined raw data from games. They analyzed video tracking data to investigate which plays provided the best opportunities for high scores, and discovered something surprising. Data analysis revealed that the shots that provide the best opportunities for high scores are two-point dunks from inside the two-point zone, and three-point shots from outside the three-point line, not long-range two-point shots from inside it. This discovery entirely changed the way the team approached each game, increasing the number of three-point shots attempted. In the 2017-18 season, the Rockets made more three-point shots than any other team in NBA history, and this was a major reason they won more games than any of their rivals.
Python and its packages are getting more and more popular in the world and Python itself provides a lot of great functionality and flexibility for us to make analysis on big datasets.